This project examines my exercise activities recorded on a smartwatch to the Strava app. The goals are to try to predict what type of activity was recorded and how many calories were burned during the exercise using all other variables. I will first be looking at all variables to understand their shapes and sizes. I will then perform Regression analysis predicting the calories burned. I intend to use both linear and lasso regression, as well as other methods as needed. I will also perform Classification analysis predicting whether an activity was a run or a ride. I will use both logistic regression and classification trees. I will end this analysis with a conclusion explaining my best models, and describe the best predictors of both the continuous and qualitative response variables.
This dataset has 497 rows and 9 variables.
This dataset was recorded using a smartwatch on the Strava app. It is a combination of biometric & GPS data which I recorded when going on a run or ride.
VARIABLES TO PREDICT WITH
VARIABLES WE WANT TO PREDICT
Activity_Type Distance_K Elapsed_Time_Minutes Max_Heart_Rate
1:436 Min. : 1.170 Min. : 3.95 Min. : 89.0
0: 61 1st Qu.: 4.860 1st Qu.: 25.73 1st Qu.:178.0
Median : 5.200 Median : 29.87 Median :182.0
Mean : 7.572 Mean : 37.68 Mean :179.8
3rd Qu.: 9.830 3rd Qu.: 46.25 3rd Qu.:185.0
Max. :81.730 Max. :329.08 Max. :196.0
Season Average_Heart_Rate Calories_Burned_Estimated
Length:497 Min. : 75.22 Min. : 23.72
Class :character 1st Qu.:150.78 1st Qu.: 449.00
Mode :character Median :157.52 Median : 504.09
Mean :154.81 Mean : 602.40
3rd Qu.:162.46 3rd Qu.: 792.21
Max. :177.03 Max. :2825.00
Moving_Time_Minutes Weather_Temperature
Min. : 3.95 Min. : 10.15
1st Qu.: 21.67 1st Qu.: 49.35
Median : 24.20 Median : 62.91
Mean : 31.38 Mean : 62.73
3rd Qu.: 39.87 3rd Qu.: 75.40
Max. :217.08 Max. :100.51
| Season | Count | freq |
|---|---|---|
| Summer | 196 | 0.394 |
| Spring | 132 | 0.266 |
| Winter | 87 | 0.175 |
| Fall | 82 | 0.165 |
The grand majority of our activities were Runs as opposed to Rides
The histogram of our calories is not normal with a long right tail. We see the largest concentration of distances is just below 500 calories. It might look more normal if we looked at runs and rides separately.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -399.673 | 108.169 | -3.695 | 0.000 |
| Distance_K | -34.960 | 2.139 | -16.340 | 0.000 |
| Elapsed_Time_Minutes | -2.813 | 0.565 | -4.977 | 0.000 |
| Max_Heart_Rate | 2.076 | 0.896 | 2.316 | 0.021 |
| SeasonSpring | 46.660 | 18.156 | 2.570 | 0.010 |
| SeasonSummer | -16.192 | 19.008 | -0.852 | 0.395 |
| SeasonWinter | -1.563 | 19.949 | -0.078 | 0.938 |
| Average_Heart_Rate | 0.779 | 0.816 | 0.955 | 0.340 |
| Moving_Time_Minutes | 27.538 | 0.990 | 27.827 | 0.000 |
| Weather_Temperature | 0.146 | 0.425 | 0.343 | 0.732 |
| .metric | .estimator | .estimate |
|---|---|---|
| rmse | standard | 123.878 |
| rsq | standard | 0.863 |
| mae | standard | 83.484 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 63.643 | 83694.593 | 0.001 | 0.999 |
| Distance_K | 35.598 | 14015.355 | 0.003 | 0.998 |
| Elapsed_Time_Minutes | 0.183 | 1201.187 | 0.000 | 1.000 |
| Max_Heart_Rate | -0.207 | 1342.719 | 0.000 | 1.000 |
| SeasonSpring | -17.671 | 40896.706 | 0.000 | 1.000 |
| SeasonSummer | -6.158 | 36926.171 | 0.000 | 1.000 |
| SeasonWinter | -1.402 | 29123.394 | 0.000 | 1.000 |
| Average_Heart_Rate | -0.292 | 1253.322 | 0.000 | 1.000 |
| Moving_Time_Minutes | -10.655 | 2560.028 | -0.004 | 0.997 |
| Weather_Temperature | 0.375 | 416.741 | 0.001 | 0.999 |
| .metric | .estimator | .estimate |
|---|---|---|
| accuracy | binary | 1 |
| specificity | binary | 1 |
| sensitivity | binary | 1 |
###Interactive exploratory graphs
row {data-height=550}
Lighter colors indicate longer distances.
---
title: "Project Part Two Dashboard"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: scroll
source_code: embed
theme: yeti
---
```{r setup, include=FALSE,warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(GGally) #v2.1.2
library(ggcorrplot) #v0.1.3
library(MASS) #v7.3-54 for Boston data
library(flexdashboard) #v0.5.2
library(plotly) #v4.10.1
library(crosstalk) #v1.2.0
library(tidymodels)
library(readr)
#library(dplyr) #v1.0.7 %>%, select(), select_if(), filter(), mutate(), group_by(),
#summarize(), tibble()
#library(ggplot2) #v3.3.5 ggplot()
```
```{r load_data}
#Load the data
df <- read_csv("Strava_df2.csv")
df <- df %>%
mutate(Activity_Type = ifelse(Activity_Type=="Run",1,0))
df$Activity_Type = factor(df$Activity_Type, levels=c(1,0))
```
Introduction {data-orientation=rows}
=======================================================================
Row {data-height=600}
-----------------------------------------------------------------------
### The Project
#### The Problem Description
This project examines my exercise activities recorded on a smartwatch to the Strava app. The goals are to try to predict what type of activity was recorded and how many calories were burned during the exercise using all other variables. I will first be looking at all variables to understand their shapes and sizes. I will then perform Regression analysis predicting the calories burned. I intend to use both linear and lasso regression, as well as other methods as needed. I will also perform Classification analysis predicting whether an activity was a run or a ride. I will use both logistic regression and classification trees. I will end this analysis with a conclusion explaining my best models, and describe the best predictors of both the continuous and qualitative response variables.
#### The Data
This dataset has 497 rows and 9 variables.
#### Data Sources
This dataset was recorded using a smartwatch on the Strava app. It is a combination of biometric & GPS data which I recorded when going on a run or ride.
### The Data
VARIABLES TO PREDICT WITH
* **Moving Time (Minutes)**: The total time measured in minutes that the person was actually moving. This does not include time spent waiting at an intersection, or for the runner’s dog to go #2.
* **Elapsed Time (Minutes)**: The total time, measured in minutes, that the activity was being recorded.
* **Average Heart Rate**: The average beats per minute of my heart over the course of the run according to my smartwatch.
* **Weather Temperature**: Temperature in Fahrenheit supplied by The Weather Channel API for the location where the run or ride occurred.
* **Season**: Winter, Spring, Summer, or Fall. This was originally the activity date, but I have converted it into season.
* **Max Heart Rate**: The highest heart rate measured during the activity
* **Distance_K**: length in kilometers measured during activity
VARIABLES WE WANT TO PREDICT
* **Activity Type**: Activity is either a Run (1) or a Bike Ride (0)
* **Calories_Burned_Estimated**: The number of calories burned according to Strava’s estimates
Data Exploration {data-orientation=rows}
=======================================================================
Column {.sidebar data-width=200}
-------------------------------------
### Data Overview
We can see that Calories burned has an incredibly wide range. Also, Activity Type is mostly made up of 1s (Runs). Along the bottom of the page we can see that most activities occurred in the Spring and Summer.
Column {data-width=450, data-height=600}
-----------------------------------------------------------------------
### View the Data Summaries
Summaries of each of our variables is below. Better visibility to Season is at the bottom.
```{r, cache=TRUE}
#View data
summary(df)
```
Column {data-width=150, data-height=300}
-----------------------------------------------------------------------
### Count of Activity by Season
Most of our activities occurred in the Summer and Spring seasons.
```{r, cache=TRUE}
#Summary table for Season variable
knitr::kable(df %>%
group_by(Season) %>%
summarize(Count=n()) %>%
mutate(freq = round(Count / sum(Count), 3)) %>%
arrange(desc(freq)))
```
Data Visualization {data-orientation=rows}
=======================================================================
### Response Variables relationships with predictors
* The grand majority of our activities were Runs as opposed to Rides
* The histogram of our calories is not normal with a long right tail. We see the largest concentration of distances is just below 500 calories. It might look more normal if we looked at runs and rides separately.
row {data-height=550}
-----------------------------------------------------------------------
#### Activity Type
```{r, cache=TRUE}
ggplot(df,aes(x=Activity_Type)) + geom_bar()
```
#### Median Value
```{r, cache=TRUE}
ggplot(df, aes(Calories_Burned_Estimated)) + geom_histogram(bins=20)
```
Row {.tabset data-height=450}
-----------------------------------------------------------------------
### Calories Burned vs Season
```{r, cache=TRUE}
ggpairs(dplyr::select(df,Calories_Burned_Estimated,Season))
```
### Calories Burned vs Continuous Variables
```{r, cache=TRUE}
ggcorrplot(cor(dplyr::select(df,Calories_Burned_Estimated,Elapsed_Time_Minutes,Max_Heart_Rate,Average_Heart_Rate,Distance_K,Moving_Time_Minutes,Weather_Temperature)))
```
### High Median Value vs Continuous Variables #1
```{r, cache=TRUE}
ggpairs(dplyr::select(df,Activity_Type,Elapsed_Time_Minutes,Max_Heart_Rate,Average_Heart_Rate,Distance_K,Moving_Time_Minutes,Weather_Temperature))
```
### High Median Value vs Categorical Variables
```{r, cache=TRUE}
df %>% group_by(Season, Activity_Type) %>%
summarize(n=n()) %>%
ggplot(aes(y=n, x=Activity_Type,fill=Season)) +
geom_bar(position="dodge", stat="identity") +
geom_text(aes(label=n), position=position_dodge(width=0.9), vjust=-0.25) +
ggtitle("Activity Type vs Season") +
coord_flip() #makes horizontal
```
Initial Models {data-orientation=rows}
=======================================================================
### Predicting Continuous Median Value
Here is a look at a regression model predicting Calories Burned.
```{r}
reg_spec <- linear_reg() %>% ## Class of problem
set_engine("lm") %>% ## The particular function that we use
set_mode("regression") ## type of model
#Fit the model
reg_fit <- reg_spec %>%
fit(Calories_Burned_Estimated ~ .-Activity_Type,data = df)
#Capture the predictions and metrics
pred_reg_fit <- augment(reg_fit, df)
knitr::kable(tidy(reg_fit$fit),
digits=3)
knitr::kable(pred_reg_fit %>%
metrics(truth=Calories_Burned_Estimated,estimate=.pred),
digits=3)
```
### Predicting Categorical Median Value
Here is a look at a logistic regression model predicting Activity Type. We see extremely high p values for all of our predictors.
```{r}
#Define the model specification
log_spec <- logistic_reg() %>%
set_engine('glm') %>%
set_mode('classification')
#Fit the model
log_fit <- log_spec %>%
fit(Activity_Type ~ .-Calories_Burned_Estimated, data = df)
#Capture the predictions and metrics
my_class_metrics <- metric_set(yardstick::accuracy, yardstick::specificity, yardstick::sensitivity)
pred_log_fit <- augment(log_fit, df)
knitr::kable(tidy(log_fit$fit),
digits=3)
knitr::kable(pred_log_fit %>%
my_class_metrics(truth=Activity_Type,estimate=.pred_class))
```
Further Data Exploration {data-orientation=rows}
=======================================================================
###Interactive exploratory graphs
row {data-height=550}
-----------------------------------------------------------------------
#### Max Heart Rate & Calories Burned Scatter Plot
Lighter colors indicate longer distances.
```{r}
library(plotly) #v4.9.4.1
fig <- plot_ly(df, x = ~Calories_Burned_Estimated, y = ~Max_Heart_Rate, type="scatter", mode="markers",symbol = ~Activity_Type, symbols = c('circle','cross'), color=~Distance_K)
fig
```
#### Plotly Interactive 3=2D Histogram Example
```{r}
ggplotly(
df %>% group_by( Activity_Type) %>%
summarize(Average_Distance=mean(Distance_K)) %>%
ggplot(aes(y=Average_Distance, x=Activity_Type,fill=Activity_Type)) +
geom_bar(position="dodge", stat="identity") +
geom_text(aes(label=Average_Distance), position=position_dodge(width=0.9), vjust=-0.25) +
ggtitle("Average Distance by Activity Type") +
coord_flip() #makes horizontal
)
```